Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dcgm-exporter daemonset for nvidia #81

Closed
wants to merge 1 commit into from
Closed

Conversation

movence
Copy link
Contributor

@movence movence commented Feb 6, 2024

Description of changes:

  • Add new dcgm-exporter daemonset to support NVIDIA GPU metrics with k8s
  • Add new service that uses dcgm-exporter pods where the traffic is limited within node (no cross nodes communication)

The new dcgm-exporter pod has nodeAffinity config that spins up itself in only GPU nodes. The supported GPU instances are hard coded into values.yaml which can be extended when there are more GPU instances/types supported.

Below is the list of pods in test cluster where 1 GPU and 1 non-gpu nodes are.

NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch-observability-controller-manager-xxxxxx   1/1     Running   0          4m17s
cloudwatch-agent-xxxxxx                                            1/1     Running   0          4m16s
cloudwatch-agent-xxxxxx                                            1/1     Running   0          4m17s
dcgm-exporter-xxxxxx                                               1/1     Running   0          4m18s
fluent-bit-xxxxxx                                                 1/1     Running   0          4m18s
fluent-bit-xxxxxx                                                  1/1     Running   0          4m17s

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

version: v1
spec:
priorityClassName: system-node-critical
serviceAccountName: {{ template "cloudwatch-agent.serviceAccountName" . }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make a different service account for the dcgm exporter instead of using the cloudwatch agent service account? There might be different privileges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great call. updated

@@ -35,10 +39,12 @@ containerLogs:
manager:
name:
image:
repository: cloudwatch-agent-operator
# repository: cloudwatch-agent-operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to revert to cloudwatch-agent-operator

tag: 1.0.2
repositoryDomainMap:
public: public.ecr.aws/cloudwatch-agent
# public: public.ecr.aws/cloudwatch-agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to revert this

tag: 1.300031.1b317
# repository: cloudwatch-agent
# tag: 1.300031.1b317
repository: tupperware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there changes to the agent that we need to make to get this to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. removed test repo

@@ -137,6 +146,9 @@ agent:
config: # optional config that can be provided to override the defaultConfig
defaultConfig:
{
"agent": {
"debug": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

"metrics_collected": {
"kubernetes": {
"gpu_metrics": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having this as true, can we just have it an empty map in case we want to add more config options in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. it was a leftover from the previous revision where there were another agent daemonset specifically for GPU nodes.

name:
image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming this will be our mirrored repo eventually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@movence movence closed this Feb 6, 2024
@mitali-salvi mitali-salvi deleted the ci-nvidia-gpu branch July 26, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants